Combination of deep speaker embeddings for diarisation
نویسندگان
چکیده
Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as embeddings extracted from neural network (NN) classifiers for clustering speech segments. To extract better-performing and more robust embeddings, this paper proposes a c-vector method by combining multiple sets complementary derived systems with different NN components. Three structures are used to implement c-vectors, namely 2D self-attentive, gated additive, bilinear pooling structures, relying on attention mechanisms, gating mechanism, low-rank mechanism respectively. Furthermore, neural-based single-pass pipeline is also proposed paper, which uses NNs achieve voice activity detection, change point embedding extraction. Experiments detailed analyses conducted challenging AMI NIST RT05 datasets consist real meetings 4--10 speakers wide range acoustic conditions. For trained training set, relative error rate (SER) reductions 13% 29% obtained using c-vectors instead dev eval respectively, reduction 15% SER observed RT05, shows robustness methods. By incorporating VoxCeleb data into best system achieved 7%, 17% and16% compared d-vector dev, eval, respectively
منابع مشابه
Speaker diarisation for broadcast news
It is often important to be able to automatically label ‘who spoke when’ during some audio data. This paper describes two systems for audio segmentation developed at CUED and MIT-LL and evaluates their performance using the speaker diarisation score defined in the 2003 Rich Transcription Evaluation. A new clustering procedure and BIC-based stopping criterion for the CUED system is introduced wh...
متن کاملDNN-Based Speaker Clustering for Speaker Diarisation
Speaker diarisation, the task of answering “who spoke when?”, is often considered to consist of three independent stages: speech activity detection, speaker segmentation and speaker clustering. These represent the separation of speech and nonspeech, the splitting into speaker homogeneous speech segments, followed by grouping together those which belong to the same speaker. This paper is concern...
متن کاملDeep Speaker Embeddings for Short-Duration Speaker Verification
The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-...
متن کاملAudio-visual synchronisation for speaker diarisation
The role of audio–visual speech synchrony for speaker diarisation is investigated on the multiparty meeting domain. We measured both mutual information and canonical correlation on different sets of audio and video features. As acoustic features we considered energy and MFCCs. As visual features we experimented both with motion intensity features, computed on the whole image, and Kanade Lucas T...
متن کاملAn Overview of Automatic Speaker Diarisation Systems
Audio diarisation is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources and other signal source/channel characteristics. Diarisation can be used for helping speech recognition, facilitating the searching...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Neural Networks
سال: 2021
ISSN: ['1879-2782', '0893-6080']
DOI: https://doi.org/10.1016/j.neunet.2021.04.020